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ABBA Neural Networks: Coping with Positivity, Expressivity, and Robustness * 


Ana Neacsu ', Jean-Christophe Pesquet t, Vlad Vasilescu ', and Corneliu Burileanu Ý 
O- >| 


Abstract. We introduce ABBA networks, a novel class of (almost) non-negative neural networks, which are shown to 
possess a series of appealing properties. In particular, we demonstrate that these networks are universal 
approximators while enjoying the advantages of non-negative weighted networks. We derive tight 
Lipschitz bounds both in the fully connected and convolutional cases. We propose a strategy for designing 
ABBA nets that are robust against adversarial attacks, by finely controlling the Lipschitz constant of the 
network during the training phase. We show that our method outperforms other state-of-the-art defenses 
against adversarial white-box attackers. Experiments are performed on image classification tasks on four 
benchmark datasets. 
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1. Introduction. Deep learning methods based on neural network models have received increas- 
ing attention in the scientific community, because of their stunning abilities to solve a variety of 
complex tasks. These powerful systems excel at learning intricate mappings and, in some cases, even 
surpass human performance. However, deep architectures usually lack interpretability and they may 
lead to over-parameterized models [13, 37]. Additionally, their robustness is not well-controlled, 
leaving them exposed to potential adversarial attacks. For instance, [49] demonstrated that by 
introducing carefully-crafted, low-magnitude adversarial perturbations, neural classifiers could be 
easily fooled [17]. A way of overcoming the aforementioned challenges consists in introducing some 
specific constraints in the neural network design. In this article, we are interested in nonnegativity 
and stability constraints on the network weights. 

It is widely accepted that humans possess the innate ability to decompose complex interactions 
into discrete, intuitive hierarchical categories before analyzing them [26]. Conceptually, this evolution 
towards part-based representation in human cognition can be linked to non-negativity restrictions on 
the network weights [7]. This idea, along with other factors, has sparked interest in neural networks 
with non-negative weights. These networks have drawn attention for several reasons. Firstly, they 
align with human understandability, making them more interpretable. Secondly, the non-negativity 
constraint can act as beneficial regularization, effectively reducing overfitting issues. Moreover, recent 
studies have demonstrated that it is possible to derive a tight Lipschitz bound for such networks. 
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This Lipschitz constant serves as a valuable metric for quantifying the robustness of the network, 
enabling us to design networks with enhanced resilience to adversarial perturbations during the 
training process. Despite their advantages, one significant drawback of networks with non-negative 
weights is that they might be less expressive than networks with arbitrary signed weights. In [54], it 
is shown that standard non-negative networks are not universal approximators [54], a limitation that 
our work overcomes. 

Approach. In this work, we are interested in neural networks having non-negative weights, 
except for the first and last linear layers. This class of networks obviously constitutes an extension of 
those having all their linear layers non-negative-valued. We focus on a particular subclass of these 
networks for which the weight matrices have a structure of the form 


i 


thus enjoying a number of algebraic properties. The corresponding networks are subsequently called 
ABBA networks. Note that weight matrices A and B are duplicated in ABBA networks, thus allowing 
us to limit the number of parameters. 

Contributions. This paper makes several key contributions, which are as follows: 

e We show that we can put any arbitrary signed network in an ABBA form. This property holds 
for fully connected as well as for convolutional neural networks. 

e Universal approximation theorems are derived for networks featuring non-negatively weigh- 
ted layers. 

e We present a method for effectively controlling the Lipschitz constant of ABBA networks!. The 
resulting training strategy applies to both fully connected and convolutional cases. Precise 
Lipschitz bounds are typically NP hard to compute for arbitrary signed networks, but our 
framework allows us to derive such bounds that are easy to compute. 

e Numerical experiments conducted on standard image datasets showcase the excellent perfor- 
mance of ABBA networks for small models. Notably, they exhibit substantial improvements 
in both performance and robustness compared to networks with exclusively non-negative 
weights. Moreover, we demonstrate that ABBA networks are competitive with robust networks 
featuring arbitrarily signed weights, trained using state-of-the-art techniques. 

Outline. The rest of the paper is organized as follows. Section 2 offers an overview of the related 
literature, while in Section 3 our main contributions concerning ABBA architectures are introduced, 
alongside a list of fundamental properties. Section 4 extends our results to the case of convolutional 
neural networks and two Lipschitz constant expressions are derived. Section 5 describes the training 
strategy we employ to generate robust models with respect to adversarial perturbations. Section 6 
details the results obtained for different image classification tasks, while Section 7 is dedicated to 
concluding remarks. 


2. Related work. Non-negative neural networks. Inspired by non-negative matrix factorization 
(NMF) techniques, the work of [7] introduces non-negative restrictions on the weights to create neural 
networks in which the hidden units correspond to identifiable concepts. [4] showed that Autoencoders 
(AE) trained under non-negativity constraints are able to derive meaningful representations that 


1A full PyTorch implementation of our framework will be made available at https://github.com/Vladimirescu/ 
ABBA-Neural-Networks-torch. 
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unearth the hidden structure of high-dimensional data. Their method showed promising results from 
both performance and feature interpretation viewpoints on four different classification tasks. [12] 
presented the first polynomial-time algorithm for Probably Approximately Correct (PAC) learning 1- 
layer neural networks with positive coefficients. Moreover, ensuring non-negativity has been shown to 
have a regularization effect, reducing feature overfitting, which is a very common problem, especially 
for tasks where the available training data is scarce [35]. Neural networks defining convex functions 
of their inputs [1] also constitute a subclass of networks with non-negative weights. 

Link with other networks. From another perspective, the idea of using redundant weights is 
reminiscent of siamese networks [6]. These architectures are successfully used to handle similarity 
learning tasks, such as face verification [50], character recognition [23], and object tracking [19]. 
Siamese networks compute a similarity metric on the representations of the inputs, after applying the 
same transformation to each one. Apart from the proven efficiency on solving computer vision tasks, 
they have lately been employed in NLP problems, e.g., computational argumentation. In [14], it is 
shown that siamese architectures outperform other baselines trained on convincingness datasets. 

Robustness. The robustness of neural networks against possible adversarial attacks is a topic that 
has received increasing attention since nowadays Al-based solutions are ubiquitous [3, 36]. A sizable 
body of literature on adversarial attacks and different defense strategies have emerged in recent 
years as a result of the work in [49]), which revealed the alluring susceptibility of neural networks to 
adversarial perturbations and proposed a box-constrained L-BFGS algorithm for finding adversarial 
examples. [15] introduced the FGSM attack as a one-step modification of the input image, following 
the direction of loss maximization, while [24] incorporated this step into an iterative method known 
as PGD, seen as an improvement over basic FGSM. DeepFool [33] iteratively searches for the closest 
adversarial point that directs the optimization towards crossing the decision boundary. DDN [42] and 
FMN [39] attacks fall into the category of projected-gradient methods, using iterative updates of the 
perturbation vector towards the minimization of its magnitude. 

Defensive strategies have been developed to alleviate this robustness issue. [47] divides adver- 
sarial defense methods into three categories: adversarial detection, gradient masking, and robust 
optimization. Adversarial Training (AT) was first introduced by [15] and later improved by [32]. 
Recent works on AT [48, 55] have successfully analyzed and refined training techniques, however, 
no theoretical certificates regarding their behavior in the presence of different adversaries have 
been established yet. Regularization-based methods, such as [30, 46, 59], include additional terms 
in their objective, steering the learning process in a direction that leads to better generalization. 
[40] provides robustness certificates for neural networks with one hidden layer, yielding an upper 
bound of the error in the presence of any adversary (see [11, 16, 41] for more advanced methods. 
Randomized smoothing [9, 25, 29, 44, 57] certifies the robustness of a classifier around an input 
point by measuring the most-likely prediction over Gaussian-corrupted versions of the point. 

Lipschitz properties of neural networks. As highlighted by [49], the Lipschitz behavior of a 
neural model is closely correlated with its robustness against adversarial attacks, providing an upper 
bound on the response given input perturbations. Controlling the Lipschitz behavior of the network 
thus offers theoretical stability guarantees. However, computing the exact constant, even for small 
networks is an NP-hard problem [20], and finding a good approximation in a reasonable time is 
an open challenge. Several solutions have been proposed lately (see for example: [5, 18, 21, 58]). 
[45] introduced deel-lip, a framework to control the Lipschitz constant of each layer individually, 
while [2] propose GroupSort networks to ensure robustness. [38] proposed a framework for training 
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fully-connected neural networks using Lipschitz regularized and constrained techniques, proving their 
effectiveness in the scenario of Gaussian-added perturbation noise. A recent result in [10] showed 
that in the case of models with non-negative weights a tight Lipschitz bound can be established, 
making possible the training of neural network models with certified robustness guarantees. 


3. ABBA Neural Networks. 


3.1. Problem formulation. In the remainder of this paper, || - || will denote the /2-norm when 
dealing with a vector, and the spectral norm when dealing with a matrix. 
An m-layer feedforward neural network can be described by the following model. 


Model 3.1. T is feedforward neural network if there exists (N;)i<i<m € (N \ {0})™ such that 
(3.1) T=Tho:...0oT 


where, for every layer index i € {1,...,m}, T; = R;(W; < +b;), W; € RV*Ni-1 is the weight matrix, 
bi € Rò the bias vector, and R;: RN’ + R™ the activation operator. N; corresponds to the number 
of inputs at the i-th layer. Such a layer is convolutive if it corresponds to a weight matrix W; having 
some Toeplitz (or block Toeplitz) structure. 

We will say that the activation operator R; is symmetric, if there exists (c;,d;) € (R%‘)? such that 


(3.2) (vz ERV) R;(x) — di = —Ri(-x + ci). 


In other words, (c;, d;)/2 is a symmetry center of the graph of R;. 
For example, if R; is squashing function used in CapsNets [43], it is such that 


pal | 
3.3 Va ER“) Rift) = x. 
(3.3) (Va ) (x) 1+ jele” 
with u = 8/(3V3). It thus satisfies the symmetry property (3.2) with c; = d; = 0. In addition, R; 
is nonexpansive, i.e. it has a Lipschitz constant equal to 1 [10]. Other examples of symmetric and 
nonexpansive activation operators are presented in Appendix SM1 ?. 


3.2. ABBA Matrices. We first define ABBA matrices which will be the main algebraic tool 
throughout this article. 


Definition 3.2. Let (Ni, N2) € (N\{0})?. An, \, is the space of ABBA matrices of size (2N2) x (21), 
that is M € An,.n, if there exist matrices A € RN2?*%1 and B € RY2*M such that 


(3.4) a= [2 4 


BA 
The sum matrix associated with M is then defined as G(M) = A+ B. 


We give some of the most relevant properties of these matrices. In particular, we will see that the 
ABBA structure is stable under standard matrix operations. 


Proposition 3.3. Let (N1, N2, N3) € (N \ {0})?. 


? Appendices with number of the form SMx can be found in the supplementary materials. 
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(i). 
(ii). 
(iii). 
(iv). 
(v). 


(vi). 


(vii). 
(viii). 


(ix). 


(x). 


(xi). 


(xii). 


If M € An, n, then its transpose MT € An, n, and G(M')=G(M)!. 

If (Mi, M2) € (An, x, )?, then Mı + M2 € An, n, and G(M; + Mo) = G(Mi) + G(M). 
If Mı € An, .n, and M € An, Na then MM € An, n, and 

G(M2M:) = G(M2)S (M1). 

An, .n, is a ring when equipped with the standard matrix addition and product. 


: ; : A B 
If A and B are two square matrices of the same size, the eigenvalues of f | are those of 


A 
À + B and A- B. 


If À and B are two matrices having the same dimensions, the spectral norm of i | is equal 


A 
to max{||A + Bl], || A — B||}. 

If M € An,,n, has non-negative elements, the spectral norm of M is |G(M)||. 

Let A € R*2*™: and B € RN2* M, and let K = min{ Nj, No}. Let (g)1<k<K (resp. (Ue i<k<K) 
be the singular values of A+ B (resp. A— B) and let {ur}icn<k / {ve}ixsr<K (resp {ti hick<K 
/ {wr }i<n<K) be associated orthonormal families of left/right singular vectors in RY? / R™:.° 
Then, the singular values of É ‘| are (Ax, Hk )i<k<K and associated orthonormal families of 


left/right singular vectors are 


SU. / ak 


If Aand B are two matrices having the same dimensions, 


al) 
V2 |-wk 1<k<K 


A B 
(3.5) rank T AL = rank(A + B) + rank(A — B). 


Let f be a function from RC™2)* CN) to RCN2)x(2N1), Assume that either f operates elementwise 
or it is a spectral function in the sense that there exists a function y: R4 — R+ such that 


2K 
(3.6) (VM e RONDXCN)) f(M) =X (Ar) 
k=1 


where K = min{ Ni, N2}, (Àk)i<k<2K are the singular values of M, and {ür}i<r<2r / 
{ox }i<kr<2K are associated orthonormal families of left / right singular vectors in R?®™? / 
RM. Then f maps any matrix in An, n, to a matrix in AN, N,. 

The best approximation of maximum rank R < min{N,,N,} (in the sense of the Frobenius 
norm) to a matrix in An, n, belongs to An, n,. 

The projection onto the spectral ball of center O and radius p € ]0, +00[ of an ABBA matrix is an 
ABBA matrix. 


The proofs of these properties are provided in Appendix SM2. 


3This means that (SM2.11) and (SM2.12) hold. 
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3.3. Extension to feedforward networks. We will now extend the previous algebraic concepts 
by introducing the class of ABBA feedforward neural networks. In the following, the structure of an 
ABBA fully-connected network will be presented from the perspective of investigating its links with 
standard networks. Such networks make use of weights that respect the structure of ABBA matrices, 
except for the first and the last layers. More precisely, the first layer maps the input to a twice-higher 
dimensional space, while the last layer performs a dimension reduction by a factor of 2. 


Definition 3.4. Let m € N \ {0}. T is an m-layer ABBA network if 
(3.7) T = (Wr : He, peN Ti Wo 


with Wo € IR(2No) x No, Woda € RNmx(2Nm) Tits € RN”, and 


(3.8) (vie {1,...,m}) T = RW: +b) 
(3.9) Ri: R?%! — RM, 
(3.10) b; E RN, 

(3.11) W; € AN: Nias 


for given positive integers (N;)o<i<m- T is an m-layer non-negative ABBA network if it is an m-layer 
ABBA network as defined above and, for every i € {1,...,m}, the elements of W; are non-negative. 


In the remainder of this paper, Nm,4 will designate the class of m-layer ABBA networks and N$ 4 
will designate the subclass of m-layer non-negative ABBA networks. This latter subclass will be the 
main topic of investigation in this work. We will also use the notation NV, A (p) to designate the set of 
neural networks in NG, 4 Where all the activation operators operate componentwise using the same 
function p: R > R. 


3.4. Link with standard neural networks. In this section, we show that we can reshape Model 3.1 
as a special case of a non-negative ABBA network. At each layer i € {1,...,m} of this model, let W7 = 
Ni N;- ee è 
(Wip diceen1iien à € [0,+00[ +*+- be the positive part of matrix W; = (Wipe)i<e<Ni << 1 
Le. 


Wi if Wine > 0 
(3.12) (Vhe{l..,Nh(e{l.. Na) Wha i 
i 0 otherwise. 
Let W7 = Wf — W; € [0, +o0[™ >^- be the negative part of W;, where all the positive elements 
of W; have been discarded. Let us now define a non-negative ABBA neural network by using these 
quantities. 


Definition 3.5. Let m € N\ {0}. Let T be the feedforward neural defined in Model 3.1. T is a network 
in NG À associated with T if it satisfies relations (3.7)-(3.11) with 


> I => 1 
(3.13) Wo = | = | , Wii = SUN = Inl 
= 
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IE, | Sn a 
all |. 4p 
O | | O E... T 
=| | aa... = me l 
ll | oo = : 
gaell | | || +... ys aenal | | |) pee a 
CURE, Ù CD | D ui 
ÙÙ CS © i ee =. 
TN, | | | perenase | | YN, | > e De D o eee -ys 
l J : 
Y : 
W; = W* — W;, W; € RM | | | | | 
| NE. | PF 
L | 
Y 


W; € (0, | iia 


Figure 1: Equivalence between a standard fully-connected layer and its ABBA correspondent. 


and 
| = g R;(x) 
619 vieton Re Fe EO), 
Ww wt Wy 
(3.15) W; = w- aA : 


Note that a weight parametrization similar to (3.15) was used in [56] for computing lower and upper 
bounds on the output of a deep equilibrium layer, but in this article W7 has negative values. 

As we will show next, the main result is that, if the activation functions are symmetric, network T 
defined above is identical to network T in terms of input-output relation, for judicious choices of the 
biases of T. 


Proposition 3.6. Let T be the m-layer feedforward network in Model 3.1. Assume that, for every 
i € {1,...,m}, the activation operator R; in the i-th layer of T satisfies the symmetry relation (3.2) 
where c; € R^ and d; € R^. Let T be the neural network of Ne A associated with T whose bias vectors 


(bi )1<i<m are linked to those (bi)1<i<m of T by the relations 
(3.16) (vie {1,...,m}) Bb = 
(3.17) bm+1 = TD 


with do = 0. Then, for every input, T delivers the same output as T. 
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The proof of this proposition is provided in Appendix A. An illustration of the link between 
fully-connected layers and ABBA matrices is shown in Figure 1. 


3.5. Expressivity of non-negative ABBA networks. One of the main advantages of non-negative 
ABBA networks with respect to standard networks with non-negative weights is that they are universal 
approximators. More specifically, we have the following result. 


Proposition 3.7. Let (ne, nr) € (N \ {0})?. Let f: R”e — R™ be a continuous function. Let K be any 
nonempty compact subset of R": and let € € ]0, +00]. 
(i). Let p: R — R be a symmetric non polynomial activation function. There exists a network 
jig Neale) with No = ne inputs and Nz = n, outputs such that 


(3.18) (Vz EK) ||T(x) — f(x)| <e. 


(ii). Let p: R > R be a symmetric continuous activation function that is continuously differentiable 
around at least one point where its derivative is nonzero. Then there exists m > 3 and 
Te N$ 4(p) with No = ne inputs, Nn+1 = ^r outputs, and 2N; = 2(ne + n, + 2) neurons in 
every layer i € {1,...,m} such that (3.18) holds. 


Proof. Proposition 3.6 shows that non-negative ABBA networks can be as expressive as signed 
networks. Combining this fact with existing universal approximation results for signed networks (see 
[28] for (i) and [22] for (ii)) allows us to deduce these results. E 


(i) addresses the case of shallow wide networks where the number of neurons in the hidden layer 
can be arbitrarily large, while (ii) corresponds to the case of deep networks having a limited number 
of neurons per layer. An illustration of these results is provided in Appendix SM7. 


3.6. Lipschitz bounds for ABBA fully-connected networks. As mentioned in the previous sections, 
the robustness of neural networks with respect to adversarial perturbations can be evaluated through 
their Lipschitz constant. However, most of the existing techniques for computing a tight estimate 
of the constant have a high computational complexity for deep or wide networks, whereas simpler 
upper bounds may turn out to be over-pessimistic. 

Nevertheless, in the context of non-negative weighted neural networks [10] proved that tight 
approximations to the Lipschitz constant can be achieved. In the following, we extend this result and 
show that we can derive a simple expression for the Lipschitz constant, using a separable bound, for 
non-negative ABBA networks. 


Proposition 3.8. Let m € N \ {0} and let T € NZA be given by (3.7)-(3.11). Assume that, for every 
i€ {1,...,m}, Riisa nonexpansive operator operating componentwise. A Lipschitz constant of T is 


(3.19) Om = |Wn+ill 1SW) - S) Woll. 


The proof of this result is detailed in Appendix B. Note that this bound expression could be easily 
extended to other norms based on the results in [10]. 

A standard separable upper bound for the Lipschitz constant [49] for the ABBA network T 
considered in the previous proposition is 


(3.20) Om = ||Wma l Wa W || Woll- 
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According to Proposition 3.3(vii), this bound reads also 
(3.21) Om = Wal Sn) SW Wol, 


which, by simple norm inequalities, is looser than 8m. 
If T is the feedforward network defined in Model 3.1 and we apply Proposition 3.8 to the 
associated non-negative ABBA network T of Definition 3.5. We have 


(3.22) || Woll = || Wo Woll? = |2 Iv"? = v2 
and 

ee = 1 
3.23 Wan = (Wm W |? = =. 
( ) | +1|| | +1 ll V2 
In turn, for every i € {1,...,m}, 
(3.24) 6(W;) = Wt + W7 = |W]. 


where |W/;| is the matrix whose elements are the absolute values of W;. Hence the Lipschitz constant 
of T in (3.19) reduces to 


(3.25) Om = |\|[Wm|.--|Walll.- 


It then follows from Proposition 3.6 that m is also a Lipschitz constant of T when using symmetric 
activation functions. Note that this bound was actually already derived in [10, Proposition 5.12]. 


4. Convolutional networks. We will now extend the results presented in Section 3 to convolutional 
layers. 


4.1. ABBA convolutional layers. For any i € {1,...,m}, W; is a convolutional layer with 
G-1 € N\{0} input channels, ¢; output channels, kernels (wi 4 p)1<p< 1 1<a<c, and stride s; € N\{0}. 
The output (y,)i<q<c, Of this layer (prior applying any activation operation) is linked to its input 


(Zp)i<p<¢i_1 by 


Gi-1 
(4.1) (Va € {L,...,G}) Ua = > tati 
p=1l 


Yq = (Uq)1.,° 


Hereabove, for every p € {1,...,G;-1}, £p = (zp(n)) ee designates a d-dimensional discrete signal. 
Dimension d = 1 corresponds to 1D signals and d = 2 to images. A similar notation is used for other 
signals, in particular u, and w;,, with q € {1,...,¢;}. The d-dimensional discrete convolution is 
denoted by x and (-) |,, is the decimation (or subsampling) by a factor s;. 

The ABBA convolutional layer W associated with W; has twice the number of input channels 
and twice the number of output ones. More specifically, its input consists of ¢;_1 signals (ZY )1<p<¢;_1 
and ¢;-1 signals (T7 )1<p<c,_,- Similarly, its output consists of ¢; signals (yf )i<q<c¢, and ¢; signals 
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(Iz )i<q<c¢;- To make the input-output relations more explicit, let us define the kernels w7, „ and wz; 
analogously to the fully connected case: 


P 


Wip qn) if wip ln) > 0 
VneZd) wt (n)= Pd Pd 
Pa) f otherwise, 
(4.2) Wig p (0) — wi pm) — Wi p0). 


Then the outputs of the ABBA layer are linked to its inputs by the following relations 


Ci-1 Gi—1 
(Vq € {1,...,G}) uy - », HO pe * Tp + 5 Vi gp * Tp 
p=1 p=1 
Cid Ci 
se 7 pers 4 aa 
(4.3) Ug = >. Wap * Tp T » Wi,g,p * Yp 
p=1 p=1 
Ug = (ui). 
Yq = (Uy hi: 


The above equations provide the general form of a convolutional ABBA layer when relaxing (4.2). 

An alternative formulation of convolutional layers in a matrix form, along with its correspondent 
d-dimensional spectral representation, is possible (see Appendix SM3). This basically amounts to 
characterize layer (4.1) by the following matrices 


wi 1,1 (0) FEA Wi, 1,ĉi—1 (n) 
(4.4) (Vn € ZA) W;(n) = : : z REX Gi- 
wic, 1 (n) e.. Wi,G,Gi-1 (n) 


defining the so-called MIMO impulse response of W;, which plays a prominent role in dynamical 
system theory [52]. The MIMO impulse response of the ABBA layer W; is then characterized by ABBA 
matrices: 


.5 Z*) W;(n)= |. i Ci) x (26:1) 
(4.5) (vn € Z*) (n) W-(n) W}(n) € [0, +o0[ , 
where Wi(n) =  (uwi,(nicaccicnear € [+o and W;(n) = 


(Wig p) isas 1ps- € [0,+o[%*S-1, This alternative view will be useful in the following 


sections. 

4.2. Lipschitz bounds for convolutional networks. In this section, we establish bounds on 
the Lipschitz constant of an m-layer convolutional neural network T. Each linear operator W; 
corresponding to layer i € {1,...,m} will be defined by (4.1). We also define a variable 
(4.6) o=] [s 

1=1 
aggregating strides from layer 1 to layer i. Subsequently, we will assume that, for every i € {1,...,m}, 
the activation operators (R;)1<i<m are nonexpansive operators. Moreover, these operators are applied 
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componentwise (see Appendix SM1). This means that, for every i € {1,...,m — 1}, there exists a 
function p; from R to R such that 


(Vx € Hi) y = Ri(x) 
(4.7) & (Ype{1,... a} (Yne Z?) y(n) = pi (tp(n)). 


In Appendix SM4, we derive frequency-based expressions allowing us to calculate bounds on 
the Lipschitz constant of T. For accurate numerical evaluations, the frequency transform in these 
expressions has to be replaced by a Discrete Fourier Transform involving a significant number of 
frequency bins (e.g., 1284). Due to this fact, a computation bottleneck occurs when MIMO filters 
are characterized by a large number of input/output channels (e.g., for 2D applications). In the 
following, we provide an alternative lower-complexity formulation for computing bounds of the 
Lipschitz constant. In the case of non-negative kernels, we show that this bound is tight. 


Theorem 4.1. Let (0; )1<i<m be the aggregated stride factors of network T, as defined by (4.6), and 
let 


(4.8) W= (Win)tom—1 Por roR (W2)to1 x Wı 


where (Wi)1<i<m are the MIMO impulse responses of each layer of network T and, for every i € 
{2,...,m}, (Wi)to:ı is the interpolated sequence by a factor o;_, of W; (see (SM3.9)). For every 
j € S(om) = {0,...,0m — 1}£, we define the following matrix: 


(4.9) W = Y W(omn +j) € [0, tool. 
neZd 
Then 
S e fs 1/2 
(4.10) On = | 5 wo (w®)' | 
jES(om) 
is a lower bound on the Lipschitz constant estimate of network T. In addition, if for every i € {1,...,m}, 
p € {l,...,G-1} and q E€ {1,...,G}, Wigp = (Wigp(n))neza is a non-negative kernel, then 6,, is a 


Lipschitz constant of T. 


The proof of Theorem 4.1 is given in Appendix C. 
The constant @,, in (D.4) is actually equal to the one calculated in Appendix SM4. The following 
majorization if thus obtained (see (SM4.4)): 


(4.11) Om S Om = ||Wrll--- Wall. 


By applying Theorem 4.1 to each individual layer (W;)1<i<m assumed to be with non-negative kernels, 
we get the following expression for the upper-bound: 


(4.12) In = | y ww) |" 
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where 


(4.13) (vi e {1,. mH Vj E S(s) We = Y Wi(sin+j). 


neZd 


The bound ĝm is generally more tractable than 0m since it separates the influence of each layer and 
does not require to compute the global matrix sequence W as expressed by (4.8). However, such 
separable bounds are usually loose. According to our observations, it turns out that, in the special 
case of convolutional layers with non-negative kernels, 6,,, and ĝm are quite close (see numerical tests 
in Appendix SM5). 

To illustrate these results, the computation of the Lipschitz bound of a layer corresponding to an 
average pooling is presented as an example in Appendix SM6. 


4.3. Bounds for ABBA convolutional networks. Let us extend the previous results to the ABBA 
context. The linear "operators of the considered ABBA network T are denoted by (W, iJo<i<xm+1- The 
weights in Wo and Wari are signed, whereas (W; )iı<i<m are convolutional layers with d-dimensional 
non-negative kernels. More precisely, we assume that, for every i € {1,...,m}, the i-th layer of 
the ABBA network has 2;_: input channels, 2; output channels, and stride s; € N \ {0}. The 
MIMO impulse response of such a layer is of the form (4.5). We make the same assumptions of 
nonexpansiveness and separability for the activation operators as in the previous section. We recall that 
(W)4, denotes the interpolated version by a factor o of a MIMO impulse response W = (W(n)) egi 

The following result is then established in Appendix D : 


Theorem 4.2. Under the above assumptions on the convolutional ABBA network T, let 


(4.14) (Vi € {1,...,m})(Vj € S(s af) = 0 S(W (sin + j)) € (0, poot S=: 


neZd 


where (W;(n)hnez is the MIMO impulse response of the ABBA layer of index i. Then a Lipschitz constant 
of T is 


(4.15) Bn = Waal (TI) E 20%)" N a, 


i=1 2 i) 


where [Wml] (resp. IWolD is the spectral norm of the linear operator employed in the last (resp. first 
layer). 


The bound (4.15) will be subsequently used to control the Lipschitz constant of non-negative ABBA 
networks during their training. 


5. Lipschitz-constrained training. The theoretical bounds established in the previous sections 
provide a relatively easy way of computing a tight estimate of the global Lipschitz constant. We 
propose a simple approach to control it during the training phase. Since our networks contain mostly 
layers having non-negative weights and a few layers having arbitrary-signed weights, their Lipschitz 
constant will be controlled separately and different constraint sets will be handled for each case. 

To train a robust ABBA network, we employ a projected version of the well-known ADAM 
optimizer. Each layer i is parameterized by a vector Ÿ;. In the case of a dense layer, Y; is a vector 
gathering the elements of weight matrix W;,, the components of the associated bias b;, and a possible 
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additional parameter that will be introduced hereafter. For an ABBA layer, Y; is thus a vector of 
dimension 2N;(N;_1 + 1) or 2N;(N;_1 + 1) +1. In the context of a 2D convolutional layer, an array 
w; of scalar convolutional kernels is substituted for the weight matrix. In the ABBA case, we have 
2¢;¢;-1 such kernels. To ensure nonnegativity (if needed) and Lipschitz bound conditions on the 
weight operator, we project Y; onto a suitable closed and convex constraint set. Considering pairs 
(zk )i<k<K Of inputs images and their associated labels, the operations performed at each epoch n > 0 
to minimize a loss function £ are presented in Algorithm 5.1. After each iteration t of the optimizer, we 
perform a projection projs, , onto a constraint set S;,;. The definition of this set and the corresponding 
method for managing the projection is detailed in the following, according to the network type. 

Handling Lipschitz constants for fully-connected layers. Consider the network defined by 
Model (3.5). In the case of fully connected networks, the Lipschitz constant is given by Proposition 3.8, 
which basically splits the bound into three terms: the first and the last account for the starting and 
ending layers, respectively, while the middle one encompasses all the ABBA layers. For the two former 
arbitrary-signed layers, we control the Lipschitz constants individually during training, by imposing a 
bound on each weight matrix spectral norm. This defines the following two constraints: 


(5.1) (Vie {0,m+1}) Wil] < Oni, 


where @,,,; is the imposed Lipschitz bound for the i-th layer. To deal with this constraint, we 


decompose the weight matrix as W; = fm, W}, which yields the constraint set 
(5.2) (Vie {0,m+1}) Six = {WE | Wi) < 1}. 


The projection onto S; 4 is performed by clipping the singular values of Ww! tol. 

In our proposed training procedure, we set 0,5.00m.m+1 = 1. This gives the network one degree of 
freedom to automatically adapt the value of the Lipschiz constant of these two layers. To do so, we 
adopt the following parametrization 


(5.3) Om,0 = expla), Om, m+1 — exp(—a), 


where a € R is a trainable parameter. It constitutes an extra component of the vector Y; when 
ie {0,m +l}. 

In the case of ABBA dense layers, we need to handle two requirements: ensure that, for every 
ie {1,...,mb}, W is a non-negative ABBA matrix, and to constrain the product of all the weight 
matrices to be such that |Win s+ Wil < Om. Since Om,09m.m+1 = 1, Om corresponds to the target 
Lipschitz bound for the ABBA network. 

For every i € {1,...,m}, W; is parameterized by W* and W7. We define the following two 
constraint sets: 


(5.4) D; = {(W; ,W,) € (RY)? | W+ > 0 and W7 > 0}, 
Wt W7 _ 
Ait h A Biz < an) : 
Here-above, matrix A; + (resp. B;1) is an ABBA matrix, which is the product of the weight matrices 


for the posterior (resp. previous layers). In this case, S;, = D; N Ci +. To perform the projection onto 
the intersection of these two sets, we use an instance of the proximal algorithm presented in [35], 


(5.5) Cit = (UP = (IRNi*Ni-1)2 | | 
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which alternates between elementary projections onto D; and projections onto the spectral ball with 
center 0 and radius ĝm. Because of Proposition 3.3 (xii), the latter projection allows us to keep the 
structure of ABBA matrices. 

Handling Lipschitz constants for convolutional layers. In the case of convolutional ABBA 
networks, we derived the bound in (4.15) which consists of the product of m + 2 terms. The Lipschitz 


bound constraint is managed by introducing auxiliary variables (i )o<i<m+1 defining upper bounds 


for each layer. At iteration t of the algorithm estimates (4,, ;.4)o<i<m+1 Of the auxiliary bounds are 
updated. Similarly to the fully connected case, we use two different types of constraints. 


For the i-th ABBA convolutional layer with i € {1,...,m}, we consider the constraint set 
(5.6) Cin = {Wil |] D OPP)" < Fiat 
JES(si) 


where matrices (a) jes(s,) are linked to the convolution kernels by the linear relation (4.14). By 
concatenating all these s? matrices horizontally, we obtain a rectangular matrix 9; which allows us to 


reexpress (5.6) in the simpler form: 
(5.7) Cig = {Wi | || Mil] < nie}. 


We also have to impose the non-negativity of the filters alongside the stability bound. This corresponds 
to a constraint set D;. Projecting onto S; ; = Ci ND; is performed by using the same iterative proximal 
algorithm as previously. 

For the first and the last layers, we impose similarly that ||W;|| < Ami. with i € {0,m + 1}. Since 
the kernels are signed, we resort a frequency formulation (see (SM4.18)) to estimate the spectral 
norm of the convolutional operator. The procedure we use is described in Appendix SM8. 

Convolutional layers are usually succeeded by an ABBA fully connected network. This part will 
be handled as explained previously. However, we need to set the upper bounds (6,, ;:)o£i£m+1 used 
in the convolutional part and the upper bound of the ABBA fully connected part. With a slight abuse 
of notation, let us denote this latter bound by 0, m+21, While the target Lipschitz constant for the 
global network is still denoted by ĝm. We have to deal with the following constraint: 


m+2 
(5.8) [] Gris = 0. 


We proceed by computing the Lipschitz constants (0m ; :)o<i<m+2 Of the layers after the ADAM update. 
Then, we set 


5 ra 6, Bm H 
(5.9) (vi € {0,..., Mm +2}) Omit = Omit (a) i 


which guarantees that (5.8) holds. Update (5.9) can be interpreted as the orthogonal projection onto 
the constraint set defined by (5.8) after a logarithmic transform of the auxiliary variables. The benefit 
of such a transform is to convexify the constraint. 
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Dataset Network Architecture Accuracy [%] 
ABBA Dense 98.33 
Algorithm 5.1 Projected ADAM Algorithm Conv 98.70 
Y; + — weights of layer i at iteration t MNIST — Non-Negative en ae 
(zk )i<k<K — set of input image-label pairs ony. : 
Bı, B2 - ADAM hyperparameters Baseline Dense 98.35 
Conv 98.68 
Partition {1,...,K} into  mini-batches ABBA Dense 20.02 
Conv 90.17 
Man)isa<Q 
t= (n = 1)Q + q FMNIST Non-Negative De 84.56 
at Conv 83.09 
# sweep mini-batches 
for qe {1, re. Q} do Baseline Dense 20.00 
: Conv 90.20 
for layer i do 
ABBA C 99.08 
Jit = > rem, n Vil(zes ss a 
i où Non-Negati C 67.30 
Hit = Biki t—-1 + (1-— Big RPS S au a 
Vit = Bavit-1 Le (1 = B2)g? Baseline Conv 98.86 
Vt =y /1 — BE = Bt) ABBA Conv 90.21 
Vis = Vi; — Yili t/ (Tit 4 €) CelebA  Non-Negative Conv 61.04 
end for Baseline Conv 90.17 
for layer i do z 
Viti = projs, (Vie) Table 1: Comparison between ABBA, full 
end for non-negative and arbitrary-signed (base- 
end for line) networks. 


6. Experiments. In this section, we show the versatility of ABBA neural networks in solving 


classification tasks. The objective of our experiments is three-fold: 


(i). First, we compare nonnegative ABBA structures with their classic non-negative counterparts 
and check that our method yields significantly better results in all considered cases. 
(ii). We then train ABBA models constrained to different Lipschitz bound values and evaluate 


their robustness against several adversarial attacks. 


(iii). Finally, we compare our proposed approach with three other well-established defense strate- 
gies, namely Adversarial Training (AT), Trade-off-inspired adversarial defense (TRADES) [59], 


Deel-Lip proposed by [45], and orthonormalization [2]. 


We validate our ABBA networks on four benchmark image classification datasets: MNIST, its more 
complex variant Fashion MNIST #, a variant ° of the Rock-Paper-Scissors (RPS) dataset [34], and a 
binary classification on CelebA [31]. For the last dataset, inspired from [45] where eyeglass detection 
is performed, we specialized our models on a different attribute, namely to identify whether a person 
is bald or not. To explore the features of different ABBA network topologies, we experiment with two 
main types of ABBA architectures: one having only fully-connected layers, further referred to as ABBA 
Dense, and another one which includes a convolutional part for feature extraction, followed by a fully 
connected classification module, ABBA Conv. Depending on the dataset, the particularities of each 
architecture are slightly different. A detailed description of all the small networks employed in this 


*https://github.com/zalandoresearch/fashion-mnist 
Shttps://github.com/DrGFreeman/rps-cv 
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Figure 2: Accuracy vs. Perturbation for different Lipschitz constants — Dense Architecture. 


work, as well as other training details, are provided in Appendix SM9. For all experiments, the input 
images were scaled in the [—1, 1] interval. 

ABBA networks vs. non-negative networks. First, we compare our non-negative ABBA networks 
with standard ones trained under non-negativity constraints. We consider standard neural networks 
having the same number of parameters as their ABBA equivalent. The results are summarized in Table 
1, indicating that ABBA neural networks yield far superior results, in all cases. In the case of fully 
connected architectures, the difference in terms of accuracy is around ~ 3% and ~ 5% for MNIST 
and FMNIST, respectively. The difference is even higher when we consider Conv architectures (e.g. 
~ 5%, ~ 7%, ~ 31% and ~ 32% for MNIST, FMNIST, RPS, and CelebA, respectively). This shows that 
standard non-negative convolutional kernels are often suboptimal for extracting relevant information 
from image data. On the other hand, training standard neural networks having arbitrary-signed 
weights gives results very similar to their ABBA equivalents in all the cases, showing that ABBA 
networks do not suffer from these shortcomings. These results are in agreement with Proposition 3.6. 

Stability vs. Performance. According to the “no free lunch” theorem [51], stability guarantees 
may impact the system performance on clean data. In this work, we train several models by following 
the approach described in Section 5. The Lipschitz constant of the network is varied in an effort to 
find the optimal trade-off between robustness and classification accuracy. This compromise is usually 
use-case specific, depending on the architecture complexity and on the dataset particularities, so the 
tightness of the imposed stability bound must be chosen accordingly. In our experiments, we limited 
the maximum Lipschitz constant we impose, so that the drop in performance does not exceed 5% of 
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Figure 3: Acc. vs. Perturbation for different Lipschitz constants ABBA Conv Architectures - MNIST 
and FMNIST. 


the baseline model accuracy (i.e., the model trained without robustness constraints). 

Adversarial attack validation. We train several robust ABBA models by varying the global 
Lipschitz bound 6,,,. We then evaluate their robustness against inputs corrupted with different levels 
of adversarial perturbations, by studying their influence on the overall performance of the system. 
To create the adversarial perturbations, we use three white-box attackers as described next. DDN 
[42] is a gradient-based {> adversarial attack method that seeks to decouple the direction and norm 
of the additive perturbation. By doing so, this attack is able to generate effective examples, while 
requiring fewer iterations than other methods. DeepFool [33] considers a linear approximation to the 
model and refines the attack sample iteratively, by selecting the point that would cross the decision 
boundary with minimal effort in the logit space. The FMN [39] attack improves the approach in 
DDN by introducing adaptive norm constraints on the perturbation, in order to balance the trade-off 
between the magnitude of the perturbation and the level of miss-classification. This results in a 
powerful attack that is able to generate adversarial examples with small perturbation levels. 

For a given maximum £2 perturbation norm, we ran each attack for 300 steps, using the default 
hyperparameters for each of the three attackers. Although all images are normalized in the [—1, 1] 
range, we report the robust accuracy w.rt. { perturbation measured in [0,1] range, which is the 
common practice in the literature. 

The results are summarized in Figures 2, 3, and 4 which show the robustness of MNIST, FMNIST, 
RPS, and CelebA ABBA models, w.r.t. increasing { norm perturbation, generated with DDN, FMN, 
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Figure 4: Acc. vs. Perturbation for different Lipschitz constants ABBA Conv Architectures — RPS 
and CelebA. 


and DeepFool attacks. A baseline model, trained without stability constraints and arbitrary-signed 
weights, is provided as a reference. These graphs could be interpreted as the expected performance 
of the model if the attack is allowed to influence the input image with an /2-norm less than €, where 
the level of perturbation € varies. For a better understanding of the adversarial perturbation effect, 
some visual examples of the attacked inputs, for all the datasets, are presented in Appendix SM10. 
It can be observed that our robust ABBA models are significantly less affected by adversarial 
inputs than the undefended baseline. This demonstrates that carefully controlling the Lipschitz 
constant during training improves the network stability against adversarial attacks. Naturally, as the 
imposed bound gets lower, the system becomes more robust. Although the difference in robustness 
between similar values of the Lipschitz constant depends on the intrinsic structure of the dataset, our 
results show that a good trade-off between robustness and performance can be achieved in all cases. 
Comparison with other defense strategies. In the following, we compare our method for 
training robust models using ABBA networks with other defense strategies. Deel-Lip, developed in 
[45], is a popular Lipschitz-based approach, which uses the spectral normalization of each layer to 
offer robustness certificates during training. We have also made comparisons with another popular 
technique to ensure 1-Lipschitz weights, via orthonormalization [2, 8]. For our experiments, we 
trained models from scratch, using the implementation provided by [2] °, in order to enforce an 


Shttps://github.com/cemanil/LNets/tree/master 
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Figure 5: Comparisons with other defense techniques Dense Architectures 


l2-norm equal to 1 using Bjorck Orthonormalization, while also preserving the gradient norm through 
GroupSort. TRADES [59] introduces a robustness regularization term into the training objective. 
This regularizer encourages the network to have similar predictions on both the original input and 
its adversarial counterparts. On the other hand, Adversarial Training (AT) implies augmenting the 
training data with adversarial samples, increasing the network generalization capabilities to different 
input alterations. However, this technique offers weak theoretical stability guarantees, as it is mainly 
dependent on the strength of the adversary used during training. 

For all experiments regarding AT, we used Projected Gradient Descent (PGD) attack to generate 
the adversarial samples with a perturbation level « = 0.5 and then employed the scheduling strategy 
introduced by [32]. Concerning TRADES, we set À = 1 for MNIST and FMNIST, and À = 1/2 for RPS 
and CelebA datasets. For all the presented techniques we considered the equivalent baseline to each 
ABBA network. 

Comparisons, in the same adversarial set-up as before, are depicted in Figures 5, 6, and 7. We 
observe that using our theoretically certified Lipschitz bound yields models which are generally more 
robust than AT and TRADES. For simple datasets, such as MNIST and FMNIST, robust ABBA and 
Deel-lip models exhibit similar behavior for low-magnitude adversarial attacks, but as we increase the 
maximum perturbation €, our method performs better. In the case of real-world datasets (RPS and 
CelebA), ABBA models exhibit high robustness properties against all the tested attacks, showing that 
our approach allows us to train neural models to reach great stability properties, without losing their 
generalization power. 
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Figure 6: Comparisons with other defense techniques Conv Architectures - MNIST and FMNIST. 


Limitations. The main limitation of our method is that non-negative ABBA operators require 
more parameters to meet the universal approximation conditions. More precisely, for given depth 
m and number of neurons (N;)o<i<m per layer, a network TE Nin,A has the same number of inputs 
and outputs as the standard feedfoward network T in Model 3.1. All the layers of T, except the 
first one, have however twice more inputs than T. Exact training times are showcased in SM11. 
Because of the ABBA structure of the weight matrices in (3.11), the maximum number of parameters 
of T is 2(N$ + Pia Ni(Ni-1 + 1) + Nj.) while the number of parameters of T is 3752, Ni(Ni-1 + 1). 
By storing W! = W,Wo € R?%)*No instead of Wı and Wo separately, the maximum number of 
parameters is reduced to 2(57°", N;(Ni-1 + 1) + N2). Moreover, since the weights are non-negative, 
the model does not necessarily require signed representations storage, so the memory space occupied 
by T could also be reduced. 

While our method does not provide any certification regarding the accuracy of the classifier in 
adversarial environments, it delivers a certified value for the Lipschitz constant of the network. 


7. Conclusions. In this paper, we introduce ABBA networks, a novel class of neural networks 
where the majority of weights are non-negative. We demonstrate that these networks are universal 
approximators, possessing all the expressive properties of conventional signed neural architectures. 
Additionally, we unveil their remarkable algebraic characteristics, enabling us to derive precise 
Lipschitz bounds for both fully-connected and convolutive operators. Leveraging these bounds, we 
construct robust neural networks suitable for various classification tasks. 


ABBA NEURAL NETWORKS 21 


— Baseline 
—— Orthonorm. 90 


— Baseline 
— Orthonorm. 90 
— AT 
— TRADES aa 
— DeelLip 8,=1.5 

— ABBA-Conv 6,=1.5 


— Baseline 
— Orthonorm. 


— DeelLip @,=1.5 


— DeelLip 8,=1.5 
— ABBA-Conv 0m=1.5 


— ABBA-Conv 8,=1.5 


Accuracy [%] 
8 
Accuracy [%] 
5 
Accuracy [%] 
8 


0.0 05 10 15 2.0 25 3.0 35 40 0.0 05 10 15 2.0 25 3.0 35 40 0.0 05 
Max. L perturbation norm Max. L perturbation norm 


(a) RPS — DDN attack (b) RPS — FMN attack (c) RPS — DeepFool attack 


10 15 2.0 25 3.0 35 40 
Max. L perturbation norm 


— Baseline — Baseline 
— Orthonorm. 90 


— Baseline 
— Orthonorm. 


— Orthonorm. 90 


— DeelLip 8=1.0 
— ABBA-Conv 0m=1.0 


— DeelLip 6=1.0 
— ABBA-Conv 8m=1.0 


— DeelLip 8=1.0 
— ABBA-Conv 8m=1.0 


Accuracy [%] 
G] 
Accuracy [%] 
j 
Accuracy [%] 
E] 


0.0 0.5 10 15 2.0 25 3.0 35 40 0.0 0.5 10 15 2.0 25 3.0 35 4.0 0.0 0.5 


10 15 20 25 30 35 4.0 
Max. L perturbation norm Max. L perturbation norm 


Max. L perturbation norm 


(d) CelebA — DDN attack (e) CelebA — FMN attack (f) CelebA — DeepFool attack 


Figure 7: Comparisons with other defense techniques Conv Architectures — RPS and CelebA. 


The main advantage of ABBA networks is that they enjoy all the properties of non-negative 
networks (e.g., they are easier to interpret and less prone to overfitting), without suffering from 
the shortcomings of standard ones (e.g., lack of expressivity). Additionally, we showed that ABBA 
structures allow tight Lipschitz bounds to be estimated, without requiring to solve an NP-hard problem 
as for conventional neural networks. 

For future research, it would be intriguing to explore the application of ABBA networks in 
regression problems, where controlling the Lipschitz constant may present more challenges. Also, 
extending our theoretical bounds to different structures, such as recurrent or attention-based networks, 
holds promise for further advancements. Moreover, it would be interesting to study if using Lipschitz- 
constrained ABBA neural networks can improve certified robustness strategies like GloRoNets [27]. 

Finally, we recognize the necessity of investigating the scalability of the proposed training method 
to deep architectures. One of the main hurdles in this endeavor is the increased number of parameters 
that deep ABBA architectures entail. 
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Appendix A. Proof of Proposition 3.6. For every i € {1,...,m}, let x; = T;(a;-1) where 
xo € RV is an arbitrary input of network T and x,, € R\" its corresponding output. By using the 
symmetry properties of the activation operators, for every i € {1,...,m}, we have 

(Vi € {1, is ,m}) Li = R(Witi-1 + bi) 
(A.1) = R(Wifais — W7 2-1 + bi), 
=z; = —R,(Wx;_1 + bi) 
(A.2) = RW: vi: = Wt zi — b; + Ci) — di. 


t 


By making use of notation (3.14), (A.1) and (A.2) can be rewritten more concisely as 


(A.3) (Vi € {1,...,m}) |] = Bi( a F bel ij p á n F A 


Let us define, for every i € {0,...,m}, 


oe Ti 


Altogether (3.16), (3.11), (3.15), and (A.3) yield 


| ~ z (M Wr] re 0 bi 
(vi € {1,...,m}) a; = Ri (i: 4 (Fit — FA ) + p B a 
(A.5) = Ri (Wizi +ù). 


This shows that, if (lise are given by (3.8), Zm = tse, (Zo). By using the forms of Wo and 


Wm+1 in (3.13), we deduce that 
(A.6) Li = Wm41m F bati = T (20). 


Appendix B. Proof for Proposition 3.8. According to [53] [10, Proposition 5.5], 


(B.1) Om = sup Won 1 Am Won sas AW Woll. 
MED} yy 
Ame DNM, 
where, for every i € {1,...,m}, Du designates the space of diagonal matrices of size (2N,;) x (2N;) 
with diagonal entries equal to —1 or 1. For every (A:,...,A,,) € 2 13 XX 2 Tiy 
(B.2)  [Wns1AmWm A WiWol < [Wml] WmAm-1Wm-1 ++ Ae Wi Wi || Wol. 


On the other hand, for every i € {1,...,m}, W; € [0, pao [rR OMe? 


Proposition 5.10] that 


. It then follows from [10, 


(B.3) sup || Win Am—1Wrm—1 AWA Wi || = |W Wm- WoW 


DNm1 
Am-1€2ç 7 
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According to Proposition 3.3 (iii), WinWm—1 ee WoW, E€ Avn,,,,N)- Since this matrix has nonnegative 
elements, we deduce from Proposition 3.3 (vii) that 
Om <S ||Wm-+ill Wm >>: Wil Woll 
= |[Win+al| S (Wm --W1)]I || Woll 
(B.4) = [Wal (S (Wm) +> 6(W1)|| ||Woll, 


where the last equality is also a consequence of Proposition 3.3 (iii). This leads to the Lipschitz bound 
in (3.19). 


Appendix C. Proof of Theorem 4.1. 
Before giving the proof of our main result, we will introduce a link between the Fourier and the 
spatial representations for a nonnegative convolutional kernel. 


Lemma C.1. Let (c,c’) € (N \ {0})? and let 


ene 


(C.1) (vn € Z*) H(n) = (hqp(n)) 0, +00 


1<g<c',1<p<c E l 


where, for every p € {1,...,c} and q € {1,...,c'}, hqp € (Z9) 7. Then, the Fourier transform H of 
(H(n)) eza is such that 


C2) veia Eol =|| 2 H(n)|| 


Proof. For every v € (0, 1]?, 


(C.3) H(v) = X` H(n)exp(-—1#2r7n v). 
neZd 
For every u = [u1,..., uc]! € C°, by using the triangle inequality, 
= c €e 2 
ECOLE O 
q=1 p=1 


= > | y 5 hq,p(n) exp ( — Dent v)u 
q=1 p=lneZzd 
<E (OY rell) 


g=1 p=lneZzd 


= | X Hul 


nezd 


PEDRI 


neZd 


(C4 DORE 


neZd 


f 


IN 


7¢'(Z*) is the space of summable d-dimensional sequences 
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where |u| denotes the vector of moduli of the components of vector u. This shows that 
(C.5) AM < | > Aa}. 
nezi 


and, consequently, 


(C.6) sup Eo) <|] > awl 
vEe[0,1]4 nezd 
In addition, the upper bound is attained since 
(C.7) H(0) = X` H(n). = 
neZd 


Next, we derive the proof for Theorem 4.1 in light of Lemma C.1. 


Proof. W being the impulse response of the MIMO filter with the frequency response given by 
(SM4.1), it follows from Noble identities [52] that W is equivalent to a convolution with W followed 
by a decimation by a factor om. Let x € Ho. Let the o,,-polyphase representation of x (resp. W) be 
defined as 


(C.8) (Vj € S(om))(Vn € Z?) x(n) = x(omn — j) 
(C.9) (resp. WO (n) = W(omn + j)). 
Then, as a result of multirate digital filtering, y = Wx if and only if 
(C.10) y= D, wax”. 
JES(om) 


This sum of MIMO convolutions can be reformulated as a single one 
(C.11) y=H xe, 


where H is the cm x ofco MIMO impulse response obtained by stacking rowwise the polyphase 
MIMO impulse responses (W ®)) j€S(om) and e is the o%,co-component d-dimensional signal obtained 
by stacking columnwise the polyphase signal components (xD es (om): For example, if d = 2, we 
have 


(Vn € Z?) 
(C.12) H(n) =[W 0% (n),..., W029) (nn), 0, WD (n)...,wW@m-bom-t)(n)] 
and 
x(0em—1)(n) 
(ann D(n) 
(C.13) (Ynez?) e(n) = 
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Note that, according to (C.8), 


lel? = $. lle@)|? 


neZd 
= E Pow 
neZd jES(om) 
= YỌ |x)? 
neZd 
(C.14) = ||x\/?. 
This equality and (C.11) imply that 
(C.15) |W = sup |E). 
ve[0,1]¢ 
We thus deduce from Lemma C.1 that 
(C.16) WI = | y H(n)}. 
neZd 
On the other hand, by using (C.9), 
|X zo| 
neZd 
Hs 1/2 
(Er) ro) 
neZd neZd 
1/2 
DS ( S wm (n)) ( D wO) 
jES(om) neZd neZd 
(C.17) -| 5 ww") 1" = 
JES(om) 


Appendix D. Proof of Theorem 4.2. 

This result is a consequence of Theorem 4.1, which provides a Lipschitz bound for nonnegative 
convolutional neural networks. By following a similar reasoning to Appendix B, a Lipschitz constant 
of the ABBA network is 


(D.1) Om = Wml mo... © WI Woll. 

Let 

(D.2) W = (W m)romi * 0 x (Wa)to, * Wi 

and let 

(D.3) (Vj ES(om)) 29 = Y 6(W(omn +3) € [0,+o0["*. 


neZd 
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Since (W;)i<i<m are convolutional operators with nonnegative kernels, it follows from Theorem 4.1 
that 


_ a ee ie 
(D.4) Om = Wall] D WW) | 1Wol. 
jES(om) 


where, for every j € S(om), 


(D.5) W? = F Wonn + j) € [0, tool 26m* 20) 


neZd 


On the other hand, for every n € Z%, W,(n) is an ABBA matrix. Since W(n) is obtained by 
multiplication and addition of such matrices, it follows from Proposition 3.3(ii) and (iii) that it is 
also an ABBA matrix. We deduce that, for every j € S(am), WO is an ABBA matrix and, by using 
Proposition 3.3 (i), ww)" is ABBA. By invoking now Proposition 3.3(i)-(iii) and (vii), we 
deduce that 


> ww)" - e( J ww)" 
jES(om) ee 
_ 5 s(w®)s(w®) | 
JES(om) 
(D.6) =| > e| 
JES(om ) 


This shows that a Lipschitz constant of the ABBA network T is 


e , A T 2/2. ~— 
(D. Om = Wal D T] aN, 
jES(om) 


Similarly to the derivation of (4.12), we deduce that fm < fm where ĝm is given by (4.15). 
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SUPPLEMENTARY MATERIALS: ABBA Neural Networks: Coping with 
Positivity, Expressivity, and Robustness* 
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SM1. Symmetric activation functions. In practice, the activation operator R; is often 
separable, that is it operates componentwise: 


(SM1.1) (Va = (Ex)icacn, ER™) Rit = (alh) icken 


where, for every k € {1,..., Ni}, oi: R — R. Examples of odd functions allowing us to define 
a symmetric separable activation operators R; with c; = d; = 0 are 

the hyperbolic tangent activation function p; = tanh 

the arctangent activation function p; = (2/7) arctan 

the inverse square root linear unit function 0;: R > R: E> €//14+ £? 

the Elliot activation function g;: R > R: €= €/(1+ |é|). 

Some examples of separable activation operators which are non-odd are described below. The 
capped ReLU function is given by 


0 if€<0 
(SM1.2) (VEER) né) dE if0<E<x 
x otherwise, 


where x € ]0,+co[. We have then c; = d; = x1n, with 1y, = [1,...,1]' € R^. We can also 
define a leaky version of this function as 


QE if E<O, 
(SM1.3) VEER) pil)=4E if O<E<x, 
alé — x)+x otherwise, 


where a €]0, 1{ and x € ]0, +o0| are hyper-parameters. 


SM2. Proof of the properties of ABBA matrices. (i)-(iii): These properties follow from 
basic algebra. We will just detail the proof of the third one. Let 


À; Bı A2 | 


Bı Ay 


(SM2.1) Mi = | + 


| mu 
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where (A1, B1) € R™2XM1 and (A2, B2) € RV3* V2, Then 


A2A1 + BoB, A2Bı + BA: 


SM2.2 MoM, = 
( ) 217 | Ay By + B241 A24; + B2Bı 


€ AN: Ni: 


In addition, 
G(M2Mi) = A2A1 + BoB, + A2B1 + BoAi 
= (A + B2)(A:1 + Bı) 
(SM2.3) = G(M)G (Mi). 


(iv): This property is a direct consequence of (ii) and (iii). 
(v): Let M = [44] e RENDOM), A €C is an eigenvalue of M if and only if 


A-)Id B 
(SM2.4) det(M-—AId)=0 © act (| : P 
We have 
(SM2.5) A—)XId B Jfa -Id]_ [A+B -Ald -A+B+Ald 
| B A-—Ald| |Id Id] |A+B-Ald A-B-Ald | 


Since À — B — Ald and —A + B + XId commute, we have [SM5] 


(SM2.6) 
at ((A+ B-Ald -A+B+Ald 
\ | Aaee Nid A-B-\ld 


}) = 2% det ((A + B — XId)(A — B — XId)). 


Similarly 
(SM2.7) det (5 Pal =2N. 


We deduce from (SM2.5) that 
A — Ald B 
act (| B a ral) =t (A+ Bata )(4~ B- ATA) 
(SM2.8) & det(M — Ald) = det(A + B — XId) det(A — B — Ald). 
So À is an eigenvalue of M if and only if det(A + B — AId) = 0 or det(A — B — \Id) = 0, i.e., 


A is an eigenvalue of A+ B or À — B. 
(vi) Let M be defined similarly to previously with (A, B) € (R2x01)2, We have 


1/2 


(SM2.9) pm = umi” = [45 + BBT AB oe 


+BA' AA +BB 


According to (v), the eigenvalues of MM! € AN, n, are those of AA'+BB'+AB'+BA!l = 
(A + B)(A + B)! and AA! + BB! — ABT - BAT = (A — B)(A — B)'. The maximum 
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eigenvalues of the two latter matrices are ||A + B|? and ||A — B|?, respectively. Therefore 
M] = max{||A + Bll, || A — B|}. 
(vii): In addition, if A and B have nonnegative elements, 


Ax — Bu 
JA-Bl|= sp Je Bel 
reR™\{0} (Ill 
< || Ala] + Blzl|| 
zeRN\{0} Iz 
|| Aa + Ball 
= sup a — 
a€[0,too["\fo lell 
(SM2.10) <||A+B]), 


where |z| denotes the vector whose components are the absolute values of those of vector x. 
We deduce from (vi) that ||M|| = || A+ B|] = |G(M)|. 
(viii): We have 


K 
(SM2.11) A+B=N Akuko 
k=1 
K =æ 
(SM2.12) A-B= SN teur. 
k=1 
Thus 
1# T + 
(SM2.13) A = 9 X Akugvg + HktkWg ) 
k=1 
j T + 
(SM2.14) B= 9 DENT — uktkwg ) 


and we deduce that 


T ; 
2 ifk=£ 
i d E dulu _ i | 
Uk Ue 0 otherwise, 


2 ifk=£ 
| te | = l ty _ 1 | 
—te 0 otherwise, 


(SM2.16) H i 
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which shows that { ll, al, iks x is an orthonormal family of R??. For similar 
reasons, {3 Ear a [ SA Jhiekex is an orthonormal family of R?M:. This allows us to conclude 
that (SM2.15) provides a singular value decomposition of [4 8]. 

(ix): The rank of [4 4] is equal to the number of its nonzero singular values. From the previous 
result, it is thus equal to the sum of the nonzero values of A + B and those of A — B, that is 
the sum of the ranks of matrices A+ B and A— B. 

(x): The fact that the ABBA structure is kept by matrix mappings operating elementwise is 
obvious. Let us thus focus on the case of spectral functions. By using the same notation as in 
(viii), it follows from (SM2.15) that 


(SM2.17) 


N 
(SM2.18) Â-B= NS plpr)trwg. 


(xi): By using the same notation as in (3.6), The best approximation of rank less than or equal 
to R to a matrix Mo in R?N2)*@%) is f(Mo) where f is given by (3.6) with 


À if A < or] 


0 otherwise, 


(SM2.19) VAER) w(\) = i 


and Ào, IR] is the R-th eigenvalue of Mọ when these are ordered by decreasing value: o > 
rek > oK. It thus follows from (x) that if Mo € An, n,, then f (Mo) € AN, N.. 

(xii): The projection onto the spectral ball of center 0 and and radius p € ]0, +oco! of a matrix 
M € AN, \, is given by (3.6) where 


(VEER)  y(€) = min{é,p}. 


The result then follows from Property (x). 


Remark SM2.1. The last result can be generalized as follows. Let Y: R — |—oo, +00] be 
a lower-semicontinuous function, which is proper, even, and convex, and let 


g: REN2)*(2N1) L ]_60, +00] 
2K 7 
(SM2.20) M= X (rk) 


i=1 
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where K = min{ N1, No} and Onickce kK are the singular values of M. The proximity operator 
of g at M € R@N2)*@1) is [SM1, Proposition 24.68]: 


1 
prox,: M++ argmin SIP — Mle + 9(P) 
PeR(2Na2)x(2N1) 
2K 7 
(SM2.21) = S proxy (Agios , 
k=1 


where ||- || denotes the Frobenius norm. It then follows from Property (x) that, if M € An, n, 
then prox,(M) € An, x. 


SM3. Link between Conv layers and MIMO systems. To be rigorous, let us first define 
the space H;_1 (resp. Hi) in which signals (x,)i<p<c; , (resp. (Yq)i<q<c,) used in (4.1) live. 
Typically, H; is some finite-dimensional subspace of ({2(Z4)) where /2(Z4) denotes the space 
of square summable discrete d-dimensional fields. For the discrete convolution * to be properly 
defined, kernels (wi,9,p)1<p<¢_1,1<q<¢, are then assumed to be summable. In practice, this 
assumption is satisfied since these kernels are chosen with finite size. 

For x = (x(n)) eza € (7(Z%), the decimation operation (-) |, returns the output signal 


(SM3.1) (vn € Z¢) y(n) = u(sin). 


Eq. (4.1) defines a MIMO (multi-input multi-output) filter that can be rexpressed in a matrix 
form as 


(vn €Z*) u(n)= $ Wi(n’)x(n—n’) 


nezd 
(SM3.2) = (W; * x)(n), 
where 
u(n) zı(n) 
(SM3.3) u(n) = : ERY, x(n)= l ER“, 
uç,(n) ra (n) 


and W;(n) is given by (4.4). (Wi(n))neza defines the so-called MIMO impulse response of 
W,;. The MIMO impulse response of an ABBA layer is similarly given by (4.5). 


These relations can also be written more concisely in the d-dimensional frequency domaint 


(SM3.4) (vv c 0,1 ü(v) = Wi) 3w), 
where 
(SM3.5) R(v) = X x(n)exp(—12rn" v) € Co, 


nezd 


1 Alternatively, we could use the d-dimensional z-transform since we are dealing with discrete-space signals. 
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(SM3.6) Ww) = D W i(n) exp(—22an |v) € CH-1*6%, 
neza 
and W; is the frequency response of the associated MIMO filter. 
Note that Jo ja \|X(v) ||?dv < +00 , whereas W; is a continuous (hence bounded) function 


on {0,1]?. Another useful result from sampling theory [SM6] is that the Fourier transform of 
y = (Yq)i<q<¢; in (4-1) is deduced from the Fourier transform of u by the relation 


(SM3.7) (We € [0,1*) Fv) = = Sa =) 
i jeS(si) | 

where 

(SM3.8) (Va E N\ {0H} Slo)=1{0,,:.,0 17 


It is also worth noting that the interpolation by a factor s of y 


n 
—| fue sZt 
(SM3.9) v=y;, © (VneZi) v(n)= (=) duo 
0 otherwise, 
translates into 
(SM3.10) (re [0,1]) Fr) = Hs), 


in the frequency domain. 


SM4. Frequency expressions of Lipschitz bounds. In this appendix, we establish 
frequency-based bounds of the Lipschitz constant of an m-layer convolutional neural net- 
work T. 

Based on the MIMO concepts introduced in section SM3, we will introduce the following 
global frequency response of the network: 


(SM4.1) (ve [0,1]4) Ww) = Wom) ++ Wo(oiv)Wi(v) € Cor, 


where W; is the frequency response associated to filter W; (see (SM3.6)). 
We have then the following result providing a frequency formula for evaluating the Lipschitz 
constant of a convolutional network. 


Proposition SM4.1. The quantity 
1/2 


1 a i\e ja" 
SM4.2 PETES J A 
( ) 6 qz SUP > Wrst) Ww (4 ) 


Om vE[0,1/om]4 jES(om) Om 


provides a lower bound on the Lipschitz constant estimate of network T°. In addition, if 


jor every i € {1,...,m}, p E€ {1,...,Çi—-1}, and q E {1,..., Çi}, dir = (Wiqp(n))neza is a 
nonnegative kernel i.e., 


(SM4.3) (vn € Z?) wipg(n) > 0, 


then Om is a Lipschitz constant of T. 


2(.)# denotes the Hermitian transpose operation. 
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Proof. In the considered case all activation operators are nonexpansive and they are 
assumed separable, except maybe at the last layer. Thus T is a special case of the networks 
investigated in [SM3, Section 5] for which a tight estimate of the Lipschitz constant was 
provided. It then follows from [SM3, Theorem 5.2] that a lower bound on this Lipschitz 
constant estimate is 


(SM4.4) Om = ||Wmo--- 0 Will. 


In addition, under the additional assumption that all the kernels are nonnegative, T is an 
instance of the positively weighted networks investigated in [SM3, Section 5.3] and it follows 
from [SM3, Proposition 5.10] that 0, is then a Lipschitz constant of T. 

So the problem is to calculate the norm of the linear operator W = Wm o -o Wi. Each 
operator W; with à € {1,...,m} is the composition of a d-dimensional MIMO filter with a 
decimator. It follows from Noble identities [SM6] that W reduces to cascading a Gm X Co 
MIMO filter with frequency response W with a decimation of each output by a factor om. 
More precisely, if x € Ho is the input of this linear system and y its output, we have in the 


frequency domain: 
= Om Om 


JES(om) 
FC) 
Om Om 


where X(+) is a vector of dimension Cm where the vectors (R((2+ j)/om)) 


(vv € [0,1]*) Hv) 


1 
on 

1 

(SM4.5) 
Om 


jE€S(om) are Sta- 


cked columnwise and W(+) is a Cm X 0, Co matrix where the matrices (Ww +§)/om))jeS(om) 
are stacked rowwise. For example, when d = 2, we have, for every v = (11,12) € [0, 1]?, 


X(11,12) 
X (21, V2 + + A 
(SM4.6) x(v)= | E Como 
x (ni, v2 + qu) 


R(11, V2) 


> 1 
X |v t 5,2 


(SM4.7) X(v) = ECom 


R (v1 + ut, 1) | 


. | 1 , — 
W (v1, 12) w (nm +=) oe W (n+ 2 ) E Coin X Fin C0 
© 


m 


WD 
= 
ww 

L 
z 

= 
Il 


a = — 1 one. mal 
(SM4.9) W(v) = Fon.) W (n+ =m) 7 W(n+2 na) € Cém* a , 


m 
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By using now Parseval’s formula, 


iyl? = [ FOIE 


1 —f(v\_f[v 
SA i od lo 
m J[0,1]4 Om Om 


2 
dv 


1 — 20 
<a fe) Ike aan 
Om /[0,1/0m]4 
1 ~ 2 2 
<] sup |W |v) |? av 
Om vE[0,1/om]4 [0,1/om]4 
1 = jie n A 
= sup W(v) >» 1 x (» + à) dv 
Om ve 0,1/om]4 jES(om) [0,1/0m]4 om 
1 — 2 A 2 
= sp |W) IIx(v) ||" dv 
m veE[0,1/om]4 [0,1]¢ 
1 — 2 2 
(SM4.10) =<c7 SUP W{(v)| |x|*. 
m v€|0,1/0m]4 
This shows that 
1 ae 2 
(SM4.11) <> sup W(v)| 


Om vE[0,1/om]4 
On the other hand since W is continuous, W is also continuous, and there exists D € [0,1/om|@ 
such that 


(SM4.12) sup 
vE[0,1/om]4 


wr = [Fe 


Let us now choose, for every v € [0,1/om]?, X (v) = ac(v)u(v) where u(v) is a unit norm 


eigenvector associated with the maximum eigenvalue of W(v)4Wiv), € € ]0, +00|, and 


(SM413) nv) = (ae ECVE {-1,0,1}9 fot -Pl < § 
0 otherwise. 


TS 


Then we see that when € — 0, the upper bound in (SM4.10) is reached. We conclude that 


5 sup W(v)| 


Om veé[0,1/om]4 


In addition, by using the relation between W and W (i.e., (SM4.8) and (SM4.9) in the 2D 
case), 


(SM4.14) Om = 


Wiv)|| =|W(v)Wiv) 
p 2 ae Ter 


(SM4.15) 


Il 
w 
= 
oe 
iS 
+ 
Se. 
© 
=) 
aie 
x 
+ 
"e 
ee 
I 


Gathering the last two equalities yields (SM4.2). E 
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When there is no decimation, i.e. the strides (s;)1<i<m are all equal to 1, (SM4.2) reduces 
to 


(SM4.16) Om = sup (Wn) Waw) Wi (w). 
ve[0,1] 


We recall that the following upper bound holds [49]: 


(SM4.17) Om < Om = [ [l 
i=1 
Applying our result to the one-layer case shows that, for every i € {1,...,m}, 
allt? 
1 _ ae : 
(SM4.18) Wall = a _ SUP y W; ( + i) W; ( + +) 
Si vE[0,1/s;]? jeS(si) Si Si 


Note that the resulting upper bound in (SM4.17) gives a loose estimate of the Lipschitz 
constant, which has however the merit to be valid for convolutional networks having kernels 
with an arbitrary sign. 


SM5. Numerical evaluation of the Lipschitz constant of nonnegative convolutional 
networks. We compare the tight bound ĝm in Theorem 4.1 with the separable one ĝm given 
by (4.12) for a classic convolutional network using non-negative kernels. The results provided 
in Table SM1 correspond to the convolutive part of LeNet-5 [SM4]. In our experiments, we 
initialized the networks with randomly sampled weights drawn from a uniform distribution on 
[0,1]. Table SM1 shows the relative difference 


for 10 distinct noise realizations. We thus observe that the difference between the two bounds 
is small. Similar observations can be made on various convolutive architectures. In contrast, 
for fully connected networks, a separable bound is usually overpessimistic. 


SM6. Lipschitz constant of average pooling. We consider the case when the i-th layer 
is an average pooling where the average is computed on patches of length L; in each dimension 
and with stride s;. For simplicity, we suppose that L; is a multiple of s;. The number of input 
and output channels is then equal, i.e. G; = G_1. The average is calculated on each channel 


Table SM1 
Lipschitz bounds obtained for 10 independent realizations of random positive initialization for LeNet-5. 


LeNet-5 
#1 #2 #3 #4 #5 #6 #7 #8 #9 #10 
Om 30302.73 27734.91 30298.73 29374.35 30180.16 28632.60 30615.02 30395.67 34828.90 30097.62 
Om 30696.07 28114.29 30860.56 29821.62 30670.05 29298.64 31152.06 30866.87 35220.36 30367.71 
er [%] 1.28 1.35 1.82 1.50 1.60 2.27 1:72 1.53 1.11 0.89 
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independently, this operation is a special case of a nonnegative convolutional layer where, for 
every n € Z, W;(n) is a diagonal matrix. The diagonal elements of this matrix are 


d 
Li 


L ifn e (0, L; -— 1)? 
(SM6.1) (Vp € {1,...,G})(Vn € Z*) wip p(n) = | 
0 otherwise. 


We deduce that, for every j € S(s;), the matrix we is also a diagonal matrix. More precisely, 
the sum in (4.13) can be restricted to values of n € {0,...,L;/s; — 1}¢ and wi) = 41d ; 


i 


We deduce that the Lipschitz constant of the average pooling layer is 


(SM6.2) Wil] = | ba (we) r _ me 
jeS(s; i 


We see that this constant is independent of the patch size and is a decreasing function of the 
stride. 


SM7. Expressivity of ABBA networks — simulations. For this experiment, we randomly 
sampled points from four distinct 2D Gaussian distributions, with different means and covari- 
ance matrices, totaling 125 2-dimensional points per class. Figure SM1 shows a comparison 
between decision boundaries resulting from training two models: a standard one trained 
conventionally and its non-negative ABBA equivalent. The two models reach a similar solution, 
showing that the theoretical properties proved in this paper are also observed in practical 
simulations. 


. 8 . 8 
o - 
; À - A 
. sie Dose $ Fi . + . r See : z T 
ES re a 0 k e ‘ee a = 4 By 
pts ll ee eet lle 
E T À af e ie AN A 2 s . 
-" 20. 2 ery 2 
, ° : . ; s ees he's ° oo e ; oe .. LA FC hee 2, ° 00 ne 
3 PE m oe fe anges e ; ous - . % a e 
ar, on, eae, 
7 nie , sn Ze” i 3 ae ar che | 
(a) Standard model (b) ABBA model 


Figure SM1. Decision space comparison between fitting an ABBA network and a standard arbitrary-signed 
one. 
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SM8. Constrained training of signed convolutional layers. The first and the last layers of 
an ABBA convolutional network have signed kernels. The norm of these layers is computed by 
using (SM4.18) and constrained to be less than 0,2 with i € {0,m +1}. Note that (SM4.18) 
makes use of the frequency response W; of filter W;. A discrete Fourier transform (DFT) is 
actually implemented (using 128 x 128 discrete frequencies). In the discrete frequency domain, 
the upper bound constraint is thus decomposed into 128? matrix norm bounds obtained by 
summing over s? frequencies. The projection onto each of these elementary constraint sets is 
computed by truncating a singular value decomposition. An additional constraint, however, is 
to be addressed, which is related to the fact that the kernels are of finite size. This implicitly 
defines a linear constraint. Projecting onto the associated vector space is simply obtained by 
truncating the kernel (after inverse DFT) to the desired size. The set S; 4 is thus defined as the 
intersection of the former matrix norm constraint set and the latter vector space. Projecting 
onto this intersection can be achieved by an iterative convex optimization approach. In our 
case, we use a Douglas-Rachford algorithm [SM2]. 


SM9. ABBA architectures. Table SM3 details the ABBA Dense and ABBA Conv ar- 
chitectures used for MNIST and FMNIST datasets, while Table SM2 shows our choices for 
RPS and CelebA datasets. As the ABBA layers have a specific form, their output size will be 
twice the number of filters. The used activation operator is the Capped Leaky ReLu (CLR) 
function defined in (SM1.3) for all Dense layers. For convolutional operators we employed a 
3 x 3 kernel, using the same activation. 

We used the official train-test split provided by the Tensorflow framework for both MNIST 
and FMNIST datasets and did not employ any augmentation strategy during training. For 
RPS and CelebA models, we resized the input images to 150 x 150, resp. 128 x 128, before 
feeding them to the network. In the case of CelebA dataset, we opted for a binary classification 
task on the bald feature. We extracted all the images containing the bald attribute, and we 


Table SM2 
ABBA Conv architectures details for RPS and CelebA datasets. 


Layer type RPS stride CelebA stride 
Input 150 x 150 x 3 128 x 128 x 3 

Conv2D 150 x 150 x 8 1 128 x 128 x 8 1 
ABBA Conv2D + CLR - = 64 x 64x 8(x2) 2 
ABBA Conv2D + CLR 75 x 75 x 32(x2) 2 32 x 32 x 16(x2) 2 
ABBA Conv2D + CLR 37 x 37 x 64(x2) 2 16 x 16 x 32(x2) 2 
ABBA Conv2D + CLR 18 x 18 x 128(x2) 2 8 x 8 x 64( x2) 2 
Conv2D 18 x 18 x 128 1 8 x 8 x 64 1 
Global Max-Pooling2D  128(x2) 64( x2) 


ABBA Dense + CLR 128(x2) = 
ABBA Dense + CLR 64( x2) — 
ABBA Dense + CLR 32(x2) — 
Dense 3 2 
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Table SM3 
ABBA Dense and ABBA Conv architecture details for MNIST and FMNIST datasets. For convolutional 
layers the stride is set to 1. 


Layer type MNIST/FMNIST 
Input 28 x 28 x 1 Layer type MNIST FMNIST 
Conv2D 28 x 28 x 32 
ABBA Conv2D + CLR 28 x 28 x 16 aia ones 
os Fo . 4 n . à ABBA Dense + CLR 128 128 
Lise 256 ABBA Dense + CLR 64 64 
ABBA Dense + CLR 128 ian PRO GRR 6 r 
ABBA Dense + CLR 64 2 
Dense 10 
Table SM4 
Training hyperparameters. 

Dataset Optimizer No. Epochs Learning rate Batch size 

MNIST projected ADAM 150 1078 1024 

FMNIST projected ADAM 200 107? 1024 

RPS projected ADAM 250 1074 64 

CelebA projected ADAM 100 1074 128 


randomly select the same number of examples from the non-bald class, in order to avoid class 
imbalance. Additional information regarding the optimization parameters used during training 
is provided in Table SM4. 


SM10. Adversarial examples. For all datasets, adversarial examples created by using 
DDN attack are displayed in Figures SM2, SM3, SM4, and SM5. We generated adversarial 
samples using untargeted DDN attacks, with a budget of 300 iterations and initial parameters 
as proposed by the authors. We did not limit the maximum perturbation €, in order to find the 
minimum one allowing us to fool the model. It can be easily seen that for DeelLip and ABBA 
networks the required perturbations for misclassification are higher. In particular, we observe 
that the perturbations needed to fool ABBA networks lead to severe artifacts in the images. 


SM11. Training time. We first compared the average time/epoch for training a standard 
network and its ABBA equivalent. Table SM5 reports the average seconds per epoch for both 
cases, for different feed-forward architectures. On average, training an ABBA neural network 
for 200 epochs on MNIST introduces less than 10% additional training time. 

The projection is a costly step, and it is the main contributor to the training overhead. A 
comparison of the training time (per-batch) with (green line) and without (dotted green line) 
projection is featured in Figures SM6a and SM6b for architectures with an increasing number 
of fully connected and convolutional layers, respectively. The average deviation from the 
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Figure SM2. Adversarial examples with DDN attack for Conv-Dense models, on MNIST dataset. 2 


perturbation magnitude is given in the top-left corner. 
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Figure SM3. Adversarial examples with DDN attack for Conv-Dense models, on FMNIST dataset. 2 
perturbation magnitude is given in the top-left corner. 
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Clean Baseline AT TRADES DeelLip ABBA 


Figure SM4. Adversarial examples generated with DDN, on RPS dataset. For each example: first row — 
adversarial images; second row — pixel differences between adversarial and clean sample. 
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Figure SM5. Adversarial examples with DeepFool attack for CelebA. l2 perturbation magnitude is given in 
the top-left corner. 
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Table SM5 
Comparison of per-epoch training times for various Standard and ABBA architectures, on MNIST. XCYF 
corresponds to an architecture with X Convolution layers, followed by Y fully-connected layers. 


Architecture 
2C2F 2C3F 3C2F 4C2F 4C3F 5C1F 


Acc [%] 95.28 95.80 99.18 99.30 99.26 99.10 
Standard Sec./Epoch 4.25 4.29 4.31 428 437 4.31 
Size (MB) 0.09 0.14 0.39 0.53 0.59 0.97 


Acc [%] 95.54 95.34 98.62 99.12 99.14 98.72 
ABBA  Sec./Epoch 4.52 461 467 468 4.79 4.74 
Size (MB) 0.18 0.28 0.78 1.06 1.18 1.94 
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Figure SM6. Computation time for the projection step for a variable-length sequence of ABBA SM6a 
fully-connected and SM6b convolutional layers. All projections wer computed with a number of 10 iterations, 
and the results were averaged over 50 independent simulations. 


imposed global bound (red), which was set to 1 in all cases, is also reported. This shows that 


we are able to maintain the imposed bounds, given the same number of iterations, irrespective 
of the network depth. 
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